Analysis of approximate nearest neighbor searching with clustered point sets
نویسندگان
چکیده
Nearest neighbor searching is a fundamental computational problem. A set of n data points is given in real d-dimensional space, and the problem is to preprocess these points into a data structure, so that given a query point, the nearest data point to the query point can be reported efficiently. Because data sets can be quite large, we are primarily interested in data structures that use only O(dn) storage. A popular class of data structures for nearest neighbor searching is the kd-tree and variants based on hierarchically decomposing space into rectangular cells. An important question in the construction of such data structures is the choice of a splitting method, which determines the dimension and splitting plane to be used at each stage of the decomposition. This choice of splitting method can have a significant influence on the efficiency of the data structure. This is especially true when data and query points are clustered in low dimensional subspaces. This is because clustering can lead to subdivisions in which cells have very high aspect ratios. We compare the well-known optimized kd-tree splitting method against two alternative splitting methods. The first, called the sliding-midpoint method, which attempts to balance the goals of producing subdivision cells of bounded aspect ratio, while not producing any empty cells. The second, called the minimum-ambiguity method is a query-based approach. In addition to the data points, it is also given a training set of query points for preprocessing. It employs a simple greedy algorithm to select the splitting plane that minimizes the average amount of ambiguity in the choice of the nearest neighbor for the training points. We provide an empirical analysis comparing these two methods against the optimized kd-tree construction for a number of synthetically generated data and query sets. We demonstrate that for clustered data and query sets, these algorithms can provide significant improvements over the standard kd-tree construction for approximate nearest neighbor searching. 1991 Mathematics Subject Classification. 68P10, 68W40.
منابع مشابه
The Analysis of a Probabilistic Approach to Nearest Neighbor Searching
Given a set S of n data points in some metric space. Given a query point q in this space, a nearest neighbor query asks for the nearest point of S to q. Throughout we will assume that the space is real d-dimensional space <d, and the metric is Euclidean distance. The goal is to preprocess S into a data structure so that such queries can be answered efficiently. Nearest neighbor searching has ap...
متن کاملApproximate Nearest Neighbor Searching in Multimedia Databases
In this paper, we develop a general framework for approximate nearest neighbor queries. We categorize the current approaches for nearest neighbor query processing based on either their ability to reduce the data set that needs to be examined, or their ability to reduce the representation size of each data object. We first propose modifications to wellknown techniques to support the progressive ...
متن کاملNearest Neighbor Search using Kd-trees
We suggest a simple modification to the kd-tree search algorithm for nearest neighbor search resulting in an improved performance. The Kd-tree data structure seems to work well in finding nearest neighbors in low dimensions but its performance degrades even if the number of dimensions increases to more than three. Since the exact nearest neighbor search problem suffers from the curse of dimensi...
متن کاملImplementing a Parallel Dynamic Approximate Nearest Neighbor Search Algorithm∗
We describe the implementation of a fast, dynamic, approximate, nearest-neighbor search algorithm that works well in fixed dimensions (d ≤ 5), based on sorting points coordinates in Morton (or z-) ordering. Our code scales well on multi-core/cpu shared memory systems. Our implementation is competitive with the best approximate nearest neighbor searching codes available on the web, especially fo...
متن کاملThe Area Code Tree for Approximate Nearest Neighbour Search in Dense Point Sets
In this paper, we present an evaluation of nearest neighbour searching using the Area Code tree. The Area Code tree is a trie-type structure that organizes area code representations of each point of interest (POI) in a data set. This data structure provides a fast method for locating an actual or approximate nearest neighbour POI for a query point. We first summarize the area code generation, i...
متن کامل